115 research outputs found

    Fusion of Multimodal Information in Music Content Analysis

    Get PDF
    Music is often processed through its acoustic realization. This is restrictive in the sense that music is clearly a highly multimodal concept where various types of heterogeneous information can be associated to a given piece of music (a musical score, musicians\u27 gestures, lyrics, user-generated metadata, etc.). This has recently led researchers to apprehend music through its various facets, giving rise to "multimodal music analysis" studies. This article gives a synthetic overview of methods that have been successfully employed in multimodal signal analysis. In particular, their use in music content processing is discussed in more details through five case studies that highlight different multimodal integration techniques. The case studies include an example of cross-modal correlation for music video analysis, an audiovisual drum transcription system, a description of the concept of informed source separation, a discussion of multimodal dance-scene analysis, and an example of user-interactive music analysis. In the light of these case studies, some perspectives of multimodality in music processing are finally suggested

    Pretext Tasks selection for multitask self-supervised speech representation learning

    Full text link
    Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning

    Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations

    Full text link
    Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models even with small annotated datasets. Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets. To address this issue, we propose a novel supervised domain adaptation method, designed for cases exhibiting such a mismatch in acoustic domains. It consists in applying properly calibrated data augmentations on a large clean dataset, bringing it closer to the target domain, and using it as part of an initial fine-tuning stage. Augmentations are automatically selected through the minimization of a conditional-dependence estimator, based on the target dataset. The approach is validated during an oracle experiment with controlled distortions and on two amateur-collected low-resource domains, reaching better performances compared to the baselines in both cases.Comment: 6 pages,INTERSPEECH 202

    Exploring new features for music classification

    Get PDF
    International audienceAutomatic music classification aims at grouping unknown songs in predefined categories such as music genre or induced emotion. To obtain perceptually relevant results, it is needed to design appropriate features that carry important information for semantic inference. In this paper, we explore novel features and evaluate them in a task of music automatic tagging. The proposed features span various aspects of the music: timbre, textual metadata, visual descriptors of cover art, and features characterizing the lyrics of sung music. The merit of these novel features is then evaluated using a classification system based on a boosting algorithm on binary decision trees. Their effectiveness for the task at hand is discussed with reference to the very common Mel Frequency Cepstral Coefficients features. We show that some of these features alone bring useful information, and that the classification system takes great advantage of a description covering such diverse aspects of songs

    MAD-EEG: an EEG dataset for decoding auditory attention to a target instrument in polyphonic music

    Get PDF
    International audienceWe present MAD-EEG, a new, freely available dataset for studying EEG-based auditory attention decoding considering the challenging case of subjects attending to a target instrument in polyphonic music. The dataset represents the first music-related EEG dataset of its kind, enabling, in particular, studies on single-trial EEG-based attention decoding, while also opening the path for research on other EEG-based music analysis tasks. MAD-EEG has so far collected 20-channel EEG signals recorded from 8 subjects listening to solo, duo and trio music excerpts and attending to one pre-specified instrument. The proposed experimental setting differs from the ones previously considered as the stimuli are polyphonic and are played to the subject using speakers instead of headphones. The stimuli were designed considering variations in terms of number and type of instruments in the mixture, spatial rendering, music genre and melody that is played. Preliminary results obtained with a state-of-the-art stimulus reconstruction algorithm commonly used for speech stimuli show that the audio representation reconstructed from the EEG response is more correlated with that of the attended source than with the one of the unattended source, proving the dataset to be suitable for such kind of studies
    • …
    corecore